{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adding a New Dataset\n",
    "\n",
    "There exist two different options to use a different dataset within the `neuralHydrology` library.\n",
    "\n",
    "1. Preprocess your data to use the `GenericDataset` in `neuralhydrology.datasetzoo.genericdataset`.\n",
    "2. Implement a new dataset class, inheriting from `BaseDataset` in `neuralhydrology.datasetzoo.basedataset`.\n",
    "\n",
    "Using the `GenericDataset` is recommended and does not require you to add/change a single line of code, while writing a new dataset gives you more freedom to do whatever you want.\n",
    "\n",
    "## Using the GenericDataset\n",
    "\n",
    "With the release of version 0.9.6-beta, we added a `GenericDataset`. This class can be used with any data, as long as the data is preprocessed in the following way:\n",
    "\n",
    "- The data directory (config argument `data_dir`) must contain a folder 'time_series' and (if static attributes are used) a folder 'attributes'.\n",
    "- The folder 'time_series' contains one netcdf file (.nc or .nc4) per basin, named '<basin_id\\>.nc/nc4'. The netcdf file has to have one coordinate called `date`, containing the datetime index.\n",
    "- The folder 'attributes' contains one or more comma-separated file (.csv) with static attributes, indexed by basin id. Attributes files can be divided into groups of basins or groups of features (but not both).\n",
    "\n",
    "If you prepare your data set following these guidelines, you can simply set the config argument `dataset` to `generic` and set the `data_dir` to the path of your preprocessed data directory.\n",
    "\n",
    "## Adding a Dataset Class\n",
    "\n",
    "The rest of this tutorial will show you how to add a new dataset class to the `neuralhydrology.datasetzoo`. As an example, we will use the [CAMELS-CL](https://hess.copernicus.org/articles/22/5817/2018/) dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "from typing import List, Dict, Union\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import xarray\n",
    "from tqdm import tqdm\n",
    "\n",
    "from neuralhydrology.datasetzoo.basedataset import BaseDataset\n",
    "from neuralhydrology.utils.config import Config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Template\n",
    "Every dataset has its own file in `neuralhydrology.datasetzoo` and follows a common template. The template can be found [here](https://github.com/neuralhydrology/neuralhydrology/blob/master/neuralhydrology/datasetzoo/template.py). \n",
    "\n",
    "The most important points are:\n",
    "- All dataset classes have to inherit from `BaseDataset` implemented in `neuralhydrology.datasetzoo.basedataset`.\n",
    "- All dataset classes have to accept the same inputs upon initialization (see below)\n",
    "- Within each dataset class, you have to implement two methods:\n",
    "  - `_load_basin_data()`: This method loads the time series data for a single basin of the dataset (e.g. meteorological forcing data and streamflow) into a time-indexed pd.DataFrame.\n",
    "  - `_load_attributes()`: This method loads the catchment attributes for all basins in the dataset and returns a basin-indexed pd.DataFrame with attributes as columns.\n",
    "  \n",
    "\n",
    "`BaseDataset` is a map-style [Pytorch Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) that implements the core logic for all data sets. It takes care of multiple temporal resolutions, data fusion, normalizations, sanity checks, etc. and also implements the required methods `__len__` (returns number of total training samples) and `__getitem__` (returns a single training sample for a given index), which PyTorch data loaders use, e.g., to create mini-batches for training. However, all of this is not important if you just want to add another dataset to the `neuralHydrology` library.\n",
    "\n",
    "### Preprocessing CAMELS-CL\n",
    "Because the CAMELS-CL dataset comes in a rather unusual file structure, we added a function to create per-basin csv files with all timeseries features. You can find the function `preprocess_camels_cl_dataset` in `neuralhydrology.datasetzoo.camelscl`, which will create a subfolder called `preprocessed` containing the per-basin files. For the remainder of this tutorial, we assume that this folder and the per-basin csv files exist.\n",
    "\n",
    "### Class skeleton\n",
    "For the sake of this tutorial, we will omit doc-strings. However, when adding your dataset class we highly encourage you to add extensive doc-strings, as we did for all dataset classes in this package. We use [Python type annotations](https://docs.python.org/3/library/typing.html) everywhere, which facilitates code development with any modern IDE as well as makes it easier to understand what is happening inside a function or class.\n",
    "\n",
    "The class skeleton looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CamelsCL(BaseDataset):\n",
    "    \n",
    "    def __init__(self,\n",
    "                 cfg: Config,\n",
    "                 is_train: bool,\n",
    "                 period: str,\n",
    "                 basin: str = None,\n",
    "                 additional_features: List[Dict[str, pd.DataFrame]] = [],\n",
    "                 id_to_int: Dict[str, int] = {},\n",
    "                 scaler: Dict[str, Union[pd.Series, xarray.DataArray]] = {}):\n",
    "        \n",
    "        # Initialize `BaseDataset` class\n",
    "        super(CamelsCL, self).__init__(cfg=cfg,\n",
    "                                       is_train=is_train,\n",
    "                                       period=period,\n",
    "                                       basin=basin,\n",
    "                                       additional_features=additional_features,\n",
    "                                       id_to_int=id_to_int,\n",
    "                                       scaler=scaler)\n",
    "\n",
    "    def _load_basin_data(self, basin: str) -> pd.DataFrame:\n",
    "        \"\"\"Load timeseries data of one specific basin\"\"\"\n",
    "        raise NotImplementedError\n",
    "\n",
    "    def _load_attributes(self) -> pd.DataFrame:\n",
    "        \"\"\"Load catchment attributes\"\"\"\n",
    "        raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data loading functions\n",
    "\n",
    "For all datasets, we implemented the actual data loading (e.g., from the txt or csv files) in separate functions outside of the class so that these functions are usable everywhere. This is useful for example when you want to inspect or visualize the discharge of a particular basin or do anything else with the basin data. These functions are implemented within the same file (since they are specific to each data set) and we use those functions from within the class methods.\n",
    "\n",
    "So let's start by implementing a function that reads a single basin file of time series data for a given basin identifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_camels_cl_timeseries(data_dir: Path, basin: str) -> pd.DataFrame:\n",
    "    preprocessed_dir = data_dir / \"preprocessed\"\n",
    "    \n",
    "    # make sure the CAMELS-CL data was already preprocessed and per-basin files exist.\n",
    "    if not preprocessed_dir.is_dir():\n",
    "        msg = [\n",
    "            f\"No preprocessed data directory found at {preprocessed_dir}. Use preprocessed_camels_cl_dataset \", \n",
    "             \"in neuralhydrology.datasetzoo.camelscl to preprocess the CAMELS CL data set once into \",\n",
    "             \"per-basin files.\"\n",
    "        ]\n",
    "        raise FileNotFoundError(\"\".join(msg))\n",
    "        \n",
    "    # load the data for the specific basin into a time-indexed dataframe\n",
    "    basin_file = preprocessed_dir / f\"{basin}.csv\"\n",
    "    df = pd.read_csv(basin_file, index_col='date', parse_dates=['date'])\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most of this should be easy to follow. First we check that the data was already preprocessed and if it wasn't, we throw an appropriate error message. Then we proceed to load the data into a pd.DataFrame and we make sure that the index is converted into a datetime format.\n",
    "\n",
    "Next, we need a function to load the attributes, which are stored in a file called `1_CAMELScl_attributes.txt`. We assume that this file exist in the root directory of the dataset (such information is useful to add to the docstring!). The dataframe that this function has to return must be basin-indexed with attributes as columns. Furthermore, we accept an optional argument `basins`, which is a list of strings. This list can specify basins of interest and if passed, we only return the attributes for said basins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_camels_cl_attributes(data_dir: Path, basins: List[str] = []) -> pd.DataFrame:\n",
    "    \n",
    "    # load attributes into basin-indexed dataframe\n",
    "    attributes_file = data_dir / '1_CAMELScl_attributes.txt'\n",
    "    df = pd.read_csv(attributes_file, sep=\"\\t\", index_col=\"gauge_id\").transpose()\n",
    "\n",
    "    # convert all columns, where possible, to numeric\n",
    "    df = df.apply(pd.to_numeric, errors='ignore')\n",
    "\n",
    "    # convert the two columns specifying record period start and end to datetime format\n",
    "    df[\"record_period_start\"] = pd.to_datetime(df[\"record_period_start\"])\n",
    "    df[\"record_period_end\"] = pd.to_datetime(df[\"record_period_end\"])\n",
    "\n",
    "    if basins:\n",
    "        if any(b not in df.index for b in basins):\n",
    "            raise ValueError('Some basins are missing static attributes.')\n",
    "        df = df.loc[basins]\n",
    "\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Putting everything together\n",
    "\n",
    "Now we have all required pieces and can finish the dataset class. Notice that we have access to all class attributes from the parent class in all methods (such as the config, which is stored in `self.cfg`). In the `_load_attributes` method, we simply defer to the attribute loading function we implemented above. The BaseDataset will take care of removing all attributes that are not specified as input features. It will also check for missing attributes in the `BaseDataset`, so you don't have to take care of this here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CamelsCL(BaseDataset):\n",
    "    \n",
    "    def __init__(self,\n",
    "                 cfg: Config,\n",
    "                 is_train: bool,\n",
    "                 period: str,\n",
    "                 basin: str = None,\n",
    "                 additional_features: List[Dict[str, pd.DataFrame]] = [],\n",
    "                 id_to_int: Dict[str, int] = {},\n",
    "                 scaler: Dict[str, Union[pd.Series, xarray.DataArray]] = {}):\n",
    "        \n",
    "        # Initialize `BaseDataset` class\n",
    "        super(CamelsCL, self).__init__(cfg=cfg,\n",
    "                                       is_train=is_train,\n",
    "                                       period=period,\n",
    "                                       basin=basin,\n",
    "                                       additional_features=additional_features,\n",
    "                                       id_to_int=id_to_int,\n",
    "                                       scaler=scaler)\n",
    "\n",
    "    def _load_basin_data(self, basin: str) -> pd.DataFrame:\n",
    "        \"\"\"Load timeseries data of one specific basin\"\"\"\n",
    "        return load_camels_cl_timeseries(data_dir=self.cfg.data_dir, basin=basin)\n",
    "\n",
    "    def _load_attributes(self) -> pd.DataFrame:\n",
    "        \"\"\"Load catchment attributes\"\"\"\n",
    "        return load_camels_cl_attributes(self.cfg.data_dir, basins=self.basins)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Integrating the dataset class into neuralHydrology\n",
    "\n",
    "With these few lines of code, you are ready to use a new dataset within the `neuralHydrology` framework. The only thing missing is to link the new dataset in the `get_dataset()` function, implemented in `neuralhydrology.datasetzoo.__init__.py`. Again, we removed the doc-string for brevity ([here](https://neuralhydrology.readthedocs.io/en/latest/api/neuralhydrology.datasetzoo.html#neuralhydrology.datasetzoo.get_dataset) you can find the documentation), but the code of this function is as simple as this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from neuralhydrology.datasetzoo.basedataset import BaseDataset\n",
    "from neuralhydrology.datasetzoo.camelscl import CamelsCL\n",
    "from neuralhydrology.datasetzoo.camelsgb import CamelsGB\n",
    "from neuralhydrology.datasetzoo.camelsus import CamelsUS\n",
    "from neuralhydrology.datasetzoo.hourlycamelsus import HourlyCamelsUS\n",
    "from neuralhydrology.utils.config import Config\n",
    "\n",
    "\n",
    "def get_dataset(cfg: Config,\n",
    "                is_train: bool,\n",
    "                period: str,\n",
    "                basin: str = None,\n",
    "                additional_features: list = [],\n",
    "                id_to_int: dict = {},\n",
    "                scaler: dict = {}) -> BaseDataset:\n",
    "    \n",
    "    # check config argument and select appropriate data set class\n",
    "    if cfg.dataset == \"camels_us\":\n",
    "        Dataset = CamelsUS\n",
    "    elif cfg.dataset == \"camels_gb\":\n",
    "        Dataset = CamelsGB\n",
    "    elif cfg.dataset == \"hourly_camels_us\":\n",
    "        Dataset = HourlyCamelsUS\n",
    "    elif cfg.dataset == \"camels_cl\":\n",
    "        Dataset = CamelsCL\n",
    "    else:\n",
    "        raise NotImplementedError(f\"No dataset class implemented for dataset {cfg.dataset}\")\n",
    "    \n",
    "    # initialize dataset\n",
    "    ds = Dataset(cfg=cfg,\n",
    "                 is_train=is_train,\n",
    "                 period=period,\n",
    "                 basin=basin,\n",
    "                 additional_features=additional_features,\n",
    "                 id_to_int=id_to_int,\n",
    "                 scaler=scaler)\n",
    "    return ds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, by settig `dataset: camels_cl` in the config file, you are able to train a model on the CAMELS-CL data set. \n",
    "\n",
    "The available time series features are:\n",
    "- tmax_cr2met\n",
    "- precip_mswep\n",
    "- streamflow_m3s\n",
    "- tmin_cr2met\n",
    "- pet_8d_modis\n",
    "- precip_chirps\n",
    "- pet_hargreaves\n",
    "- streamflow_mm\n",
    "- precip_cr2met\n",
    "- swe\n",
    "- tmean_cr2met\n",
    "- precip_tmpa\n",
    "\n",
    "For a list of available attributes, look at the `1_CAMELScl_attributes.txt` file or make use of the above implemented function to load the attributes into a pd.DataFrame."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
